68 research outputs found
Saliency-guided Adaptive Seeding for Supervoxel Segmentation
We propose a new saliency-guided method for generating supervoxels in 3D
space. Rather than using an evenly distributed spatial seeding procedure, our
method uses visual saliency to guide the process of supervoxel generation. This
results in densely distributed, small, and precise supervoxels in salient
regions which often contain objects, and larger supervoxels in less salient
regions that often correspond to background. Our approach largely improves the
quality of the resulting supervoxel segmentation in terms of boundary recall
and under-segmentation error on publicly available benchmarks.Comment: 6 pages, accepted to IROS201
Small, but important: Traffic light proposals for detecting small traffic lights and beyond
Traffic light detection is a challenging problem in the context of
self-driving cars and driver assistance systems. While most existing systems
produce good results on large traffic lights, detecting small and tiny ones is
often overlooked. A key problem here is the inherent downsampling in CNNs,
leading to low-resolution features for detection. To mitigate this problem, we
propose a new traffic light detection system, comprising a novel traffic light
proposal generator that utilizes findings from general object proposal
generation, fine-grained multi-scale features, and attention for efficient
processing. Moreover, we design a new detection head for classifying and
refining our proposals. We evaluate our system on three challenging, publicly
available datasets and compare it against six methods. The results show
substantial improvements of at least on small and tiny traffic lights,
as well as strong results across all sizes of traffic lights.Comment: Accepted at ICVS 202
Audio-Visual Speech Enhancement with Score-Based Generative Models
This paper introduces an audio-visual speech enhancement system that
leverages score-based generative models, also known as diffusion models,
conditioned on visual information. In particular, we exploit audio-visual
embeddings obtained from a self-super\-vised learning model that has been
fine-tuned on lipreading. The layer-wise features of its transformer-based
encoder are aggregated, time-aligned, and incorporated into the noise
conditional score network. Experimental evaluations show that the proposed
audio-visual speech enhancement system yields improved speech quality and
reduces generative artifacts such as phonetic confusions with respect to the
audio-only equivalent. The latter is supported by the word error rate of a
downstream automatic speech recognition model, which decreases noticeably,
especially at low input signal-to-noise ratios.Comment: Submitted to ITG Conference on Speech Communicatio
- …